In this project, we will do a study on data analysis and applications on 30 different stocks in Borsa Istanbul, their prices and corresponding time periods. Additionally, starting from December 24th, we will make a 10-hour stock price prediction for 30 stocks (300 in total) every day. We will analyze these predictions with different Data Mining/Machine Learning algorithms and examine which one might be more efficient to use in this report. Since there are not much raw columns in data, we need to somehow obtain an extra information from existing features or from another sources (such as Yahoo Finance using Quantmod library).
The data includes hourly average prices of selected stocks from Borsa İstanbul, presented in separate csv files for each time interval, encompassing about 4 years of data in long format. We will going to find ways to handle this long format since we do not want other stocks affects a specific stock’s price in negative way.
Let’s start with initialization.
At first, We loaded the data to the R Environment to complete the parts of the question. Also, loaded some libraries providing the visualizations to interpret the data.
library(data.table)
library(quantmod) # for Yahoo Finance library
library(ggplot2)
library(dplyr)
library(lubridate)
library(plotly)
library(rpart.plot)
library(skimr)
library(GGally)
library(caret) # for machine learning and predictive modeling
library(rpart) # for rpart
library(gbm) # for Boosted Decision Trees
library(randomForest) # for random forests
library(reshape2) # for graphs
library(FNN)
library(class)
require(rattle)
library(pROC)
library(pdp)
library(MLmetrics)
library(magick)
library(forecast)
library(cluster)
library(plotly)
library(factoextra)
library(Metrics)
Note that the path is set to the folder in my computer.
Please change it to run the commands properly.
As can be seen that there are 24 different csv files including historical stock prices of 30 stocks from 2018 to 2023. With the script below, they will be merged to a single data frame.
path <- "C:/Users/anil.turgut/Desktop/582Project/Data"
# Get a list of CSV files in the directory
files <- list.files(path, pattern = "\\.csv", full.names = TRUE)
files
## [1] "C:/Users/anil.turgut/Desktop/582Project/Data/20180101_20180401_bist30.csv"
## [2] "C:/Users/anil.turgut/Desktop/582Project/Data/20180402_20180701_bist30.csv"
## [3] "C:/Users/anil.turgut/Desktop/582Project/Data/20180702_20180930_bist30.csv"
## [4] "C:/Users/anil.turgut/Desktop/582Project/Data/20181001_20181230_bist30.csv"
## [5] "C:/Users/anil.turgut/Desktop/582Project/Data/20181231_20190331_bist30.csv"
## [6] "C:/Users/anil.turgut/Desktop/582Project/Data/20190401_20190630_bist30.csv"
## [7] "C:/Users/anil.turgut/Desktop/582Project/Data/20190701_20190929_bist30.csv"
## [8] "C:/Users/anil.turgut/Desktop/582Project/Data/20190930_20191229_bist30.csv"
## [9] "C:/Users/anil.turgut/Desktop/582Project/Data/20191230_20200329_bist30.csv"
## [10] "C:/Users/anil.turgut/Desktop/582Project/Data/20200330_20200628_bist30.csv"
## [11] "C:/Users/anil.turgut/Desktop/582Project/Data/20200629_20200927_bist30.csv"
## [12] "C:/Users/anil.turgut/Desktop/582Project/Data/20200928_20201227_bist30.csv"
## [13] "C:/Users/anil.turgut/Desktop/582Project/Data/20201228_20210328_bist30.csv"
## [14] "C:/Users/anil.turgut/Desktop/582Project/Data/20210329_20210627_bist30.csv"
## [15] "C:/Users/anil.turgut/Desktop/582Project/Data/20210628_20210926_bist30.csv"
## [16] "C:/Users/anil.turgut/Desktop/582Project/Data/20210927_20211226_bist30.csv"
## [17] "C:/Users/anil.turgut/Desktop/582Project/Data/20211227_20220327_bist30.csv"
## [18] "C:/Users/anil.turgut/Desktop/582Project/Data/20220328_20220626_bist30.csv"
## [19] "C:/Users/anil.turgut/Desktop/582Project/Data/20220627_20220925_bist30.csv"
## [20] "C:/Users/anil.turgut/Desktop/582Project/Data/20220926_20221225_bist30.csv"
## [21] "C:/Users/anil.turgut/Desktop/582Project/Data/20221226_20230326_bist30.csv"
## [22] "C:/Users/anil.turgut/Desktop/582Project/Data/20230327_20230625_bist30.csv"
## [23] "C:/Users/anil.turgut/Desktop/582Project/Data/20230626_20230924_bist30.csv"
## [24] "C:/Users/anil.turgut/Desktop/582Project/Data/20230925_20231120_bist30.csv"
## [25] "C:/Users/anil.turgut/Desktop/582Project/Data/20231121_20231223_bist30.csv"
# Create an empty data frame to store the merged data
data <- data.frame()
# Loop through each file and merge it into the 'data' data frame
for (file in files) {
# Read the CSV file
current_data <- read.csv(file, header = TRUE)
# Merge the current data with the existing merged data
data <- bind_rows(data, current_data)
}
head(data)
## timestamp price short_name
## 1 2018-01-02 09:00:00+03:00 15.79 THYAO
## 2 2018-01-02 10:00:00+03:00 16.01 THYAO
## 3 2018-01-02 11:00:00+03:00 16.05 THYAO
## 4 2018-01-02 12:00:00+03:00 16.05 THYAO
## 5 2018-01-02 13:00:00+03:00 16.06 THYAO
## 6 2018-01-02 14:00:00+03:00 16.05 THYAO
dim(data)
## [1] 446181 3
We have successfully loaded data as a single dataframe, now move on with the Feature Engineering part.
We have three features in our dataset which are
timestamp, short_name, price. Since these information are
not sufficient to create strong learning models, we need to obtain new
features using existing ones or from another source.
Also, there are some data modifications that will helps us in further analysis or modeling in below script.
# Changing timestamp's type to POSIX to use as time series feature
data$timestamp <- as.POSIXct(data$timestamp, format = "%Y-%m-%d %H:%M:%S", tz = "UTC")
class(data$timestamp)
## [1] "POSIXct" "POSIXt"
# Create a new feature which will be used as a key while merging further information
data$time <- as.Date(format(data$timestamp, "%Y/%m/%d"))
# Extract year, month, day, and hour
data$year <- year(data$timestamp)
data$month <- month(data$timestamp)
data$day <- day(data$timestamp)
data$hour <- hour(data$timestamp)
In below we add a new feature called “forecast_ma_6” which basically includes the Moving Average forecast with 6 records of each stock price. If there are no previous 6 records in the current record, then its forecast will be set as its original price. We except that this new feature contributes while learning algorithms
stock_symbols <- c(
"THYAO", "AKBNK", "ARCLK", "ASELS", "BIMAS", "DOHOL", "EKGYO", "EREGL", "FROTO", "GUBRF",
"GARAN", "KRDMD", "KCHOL", "KOZAL", "KOZAA", "PGSUS", "PETKM", "SAHOL", "SASA", "SISE",
"TAVHL", "TKFEN", "TUPRS", "TTKOM", "TCELL", "HALKB", "ISCTR", "VAKBN", "VESTL", "YKBNK"
)
add_forecast_column <- function(data) {
data <- data %>%
arrange(time) %>%
mutate(`forecast_ma_6` = ifelse(row_number() < 7, price,
(lag(price, 1) + lag(price, 2) + lag(price, 3) +
lag(price, 4) + lag(price, 5) + lag(price, 6)) / 6))
return(data)
}
# Apply the function for each stock symbol
result_list <- lapply(stock_symbols, function(symbol) {
symbol_data <- data %>% filter(short_name == symbol)
symbol_data <- add_forecast_column(symbol_data)
return(symbol_data)
})
# Combine the results into a single data frame
data <- bind_rows(result_list)
# Moving price column to the last place of dataframe.
data <- data[, c(setdiff(names(data), "price"), "price")]
head(data,8)
## timestamp short_name time year month day hour forecast_ma_6
## 1 2018-01-02 09:00:00 THYAO 2018-01-02 2018 1 2 9 15.79000
## 2 2018-01-02 10:00:00 THYAO 2018-01-02 2018 1 2 10 16.01000
## 3 2018-01-02 11:00:00 THYAO 2018-01-02 2018 1 2 11 16.05000
## 4 2018-01-02 12:00:00 THYAO 2018-01-02 2018 1 2 12 16.05000
## 5 2018-01-02 13:00:00 THYAO 2018-01-02 2018 1 2 13 16.06000
## 6 2018-01-02 14:00:00 THYAO 2018-01-02 2018 1 2 14 16.05000
## 7 2018-01-02 15:00:00 THYAO 2018-01-02 2018 1 2 15 16.00167
## 8 2018-01-02 16:00:00 THYAO 2018-01-02 2018 1 2 16 16.05500
## price
## 1 15.79
## 2 16.01
## 3 16.05
## 4 16.05
## 5 16.06
## 6 16.05
## 7 16.11
## 8 16.13
We are going to use the quantmod library to extract
related-meaningful information about our 30 stocks such as volume of the
stock for a day. We will analyze whether these extracted features
contribute our models or not.
Basically, quantmod is an R package that provides a
framework for quantitative financial modeling and trading. It provides a
rapid prototyping environment that makes modeling easier by removing the
repetitive workflow issues surrounding data management and
visualization.
Again, for the specified 30 stocks, start_date and end_date (must be t + 1 since it accepts as exclusive rhs), we are going to extract the information about stocks such as open, close prices, volume etc.
Also, since the quantmod library does not allow us to
fetch hourly data more than 7 days, we move on with the daily average
volume data of stocks.
% Note that end_date must be updated after new data has been obtained.
# Specify the start and end dates
start_date <- as.Date("2018-01-02")
end_date <- as.Date("2023-11-22")
# Create an empty data frame to store the data
all_stock_data <- data.frame()
# Loop through each stock symbol and retrieve data
for (stock_symbol in stock_symbols) {
# Append ".IS" to the stock symbol
full_symbol <- paste0(stock_symbol, ".IS")
# Download stock data
getSymbols(full_symbol, src = "yahoo", from = start_date, to = end_date)
# Extract hourly data for the specified time range
hourly_data <- to.hourly(get(full_symbol))
# Append the data to the 'all_stock_data' data frame
all_stock_data <- rbind(all_stock_data, data.frame(Symbol = stock_symbol, hourly_data))
}
# Remove the prefix "get.full_symbol.." from column names
cleaned_colnames <- gsub("^get.*\\.", "", colnames(all_stock_data))
colnames(all_stock_data) <- cleaned_colnames
all_stock_data$Date <- as.Date(rownames(all_stock_data))
# Print the first few rows of the combined data
head(all_stock_data)
## Symbol Open High Low Close Volume Adjusted Date
## 2018-01-02 THYAO 15.79 16.18 15.67 16.08 44253261 16.08 2018-01-02
## 2018-01-03 THYAO 16.05 16.23 15.92 16.20 40100544 16.20 2018-01-03
## 2018-01-04 THYAO 16.20 16.51 16.18 16.29 54106099 16.29 2018-01-04
## 2018-01-05 THYAO 16.44 16.55 16.18 16.33 55516943 16.33 2018-01-05
## 2018-01-08 THYAO 16.53 16.55 16.29 16.29 28138834 16.29 2018-01-08
## 2018-01-09 THYAO 16.29 16.45 16.10 16.10 34728669 16.10 2018-01-09
Now, we have obtained the 30 stocks’ information between start and
end dates. Since we have the 10-hour information of stock prices, we can
use the volume information.
To add the volume information to our previous main data, we can use merge() function since there are significant number of records, it would be extremely efficient. Let’s merge them on stock_name and the date (ex. 2022-11-10). Also, we are going to eliminate some of the features such as time from the merged final dataframe.
Moreover, we do not want missing or anomaly volume values in our dataframe. As can be seen below, there are no missing values, luckily. But, we have seen that all stock’s volumes are 0 in a specific day (The day that COVID starts in Turkey). Thus, we are going to eliminate the records whose volume is 0 to inhibit the bias.
# Merge the two dataframes based on the 'short_name' and 'Date' columns
merged_data <- merge(data, all_stock_data[, c("Symbol", "Volume", "Date")],
by.x = c("short_name", "time"), by.y = c("Symbol", "Date"), all.x = TRUE)
# Rename the merged 'Volume' column to 'volume' in the 'data' dataframe
merged_data <- rename(merged_data, volume = Volume)
merged_data <- merged_data %>% arrange(short_name, timestamp)
final_data <- merged_data %>%
select(timestamp, short_name, year, month, day, hour, volume,"forecast_ma_6", price)
head(final_data)
## timestamp short_name year month day hour volume forecast_ma_6
## 1 2018-01-02 09:00:00 AKBNK 2018 1 2 9 22346109 6.9475
## 2 2018-01-02 10:00:00 AKBNK 2018 1 2 10 22346109 7.0602
## 3 2018-01-02 11:00:00 AKBNK 2018 1 2 11 22346109 7.0954
## 4 2018-01-02 12:00:00 AKBNK 2018 1 2 12 22346109 7.0814
## 5 2018-01-02 13:00:00 AKBNK 2018 1 2 13 22346109 7.1024
## 6 2018-01-02 14:00:00 AKBNK 2018 1 2 14 22346109 7.1377
## price
## 1 6.9475
## 2 7.0602
## 3 7.0954
## 4 7.0814
## 5 7.1024
## 6 7.1377
colSums(sapply(all_stock_data, is.na))
## Symbol Open High Low Close Volume Adjusted Date
## 0 0 0 0 0 0 0 0
final_data <- final_data %>%
filter(volume != 0)
head(final_data,8)
## timestamp short_name year month day hour volume forecast_ma_6
## 1 2018-01-02 09:00:00 AKBNK 2018 1 2 9 22346109 6.947500
## 2 2018-01-02 10:00:00 AKBNK 2018 1 2 10 22346109 7.060200
## 3 2018-01-02 11:00:00 AKBNK 2018 1 2 11 22346109 7.095400
## 4 2018-01-02 12:00:00 AKBNK 2018 1 2 12 22346109 7.081400
## 5 2018-01-02 13:00:00 AKBNK 2018 1 2 13 22346109 7.102400
## 6 2018-01-02 14:00:00 AKBNK 2018 1 2 14 22346109 7.137700
## 7 2018-01-02 15:00:00 AKBNK 2018 1 2 15 22346109 7.070767
## 8 2018-01-02 16:00:00 AKBNK 2018 1 2 16 22346109 7.100100
## price
## 1 6.9475
## 2 7.0602
## 3 7.0954
## 4 7.0814
## 5 7.1024
## 6 7.1377
## 7 7.1235
## 8 7.1235
colSums(sapply(final_data, is.na))
## timestamp short_name year month day
## 0 0 0 0 0
## hour volume forecast_ma_6 price
## 0 0 0 0
volume is successfully added to the final dataframe,
also “Year,Month,Day,Hour” info are extracted from the timestamp
feature. There are no missing values. We are ready to move on.
In this part, we need to create 30 different dataframe from the main dataframe since each analysis & modelling will be performed independently from another stock. Thus, we will be using these stocks separately.
thyao_data <- final_data %>% filter(short_name == "THYAO")
akbnk_data <- final_data %>% filter(short_name == "AKBNK")
arclk_data <- final_data %>% filter(short_name == "ARCLK")
asels_data <- final_data %>% filter(short_name == "ASELS")
bimas_data <- final_data %>% filter(short_name == "BIMAS")
dohol_data <- final_data %>% filter(short_name == "DOHOL")
ekgyo_data <- final_data %>% filter(short_name == "EKGYO")
eregl_data <- final_data %>% filter(short_name == "EREGL")
froto_data <- final_data %>% filter(short_name == "FROTO")
gubrf_data <- final_data %>% filter(short_name == "GUBRF")
garan_data <- final_data %>% filter(short_name == "GARAN")
krdmd_data <- final_data %>% filter(short_name == "KRDMD")
kchol_data <- final_data %>% filter(short_name == "KCHOL")
kozal_data <- final_data %>% filter(short_name == "KOZAL")
kozaa_data <- final_data %>% filter(short_name == "KOZAA")
pgsus_data <- final_data %>% filter(short_name == "PGSUS")
petkm_data <- final_data %>% filter(short_name == "PETKM")
sahol_data <- final_data %>% filter(short_name == "SAHOL")
sasa_data <- final_data %>% filter(short_name == "SASA")
sise_data <- final_data %>% filter(short_name == "SISE")
tahvl_data <- final_data %>% filter(short_name == "TAVHL")
tkfen_data <- final_data %>% filter(short_name == "TKFEN")
tuprs_data <- final_data %>% filter(short_name == "TUPRS")
ttkom_data <- final_data %>% filter(short_name == "TTKOM")
tcell_data <- final_data %>% filter(short_name == "TCELL")
halkb_data <- final_data %>% filter(short_name == "HALKB")
isctr_data <- final_data %>% filter(short_name == "ISCTR")
vakbn_data <- final_data %>% filter(short_name == "VAKBN")
vestl_data <- final_data %>% filter(short_name == "VESTL")
ykbnk_data <- final_data %>% filter(short_name == "YKBNK")
In this part, we are going to understand and analyze the data more by using summary, aggregated functions, sense visualizations etc.
We have data with one POSIX, one CHR and rest either int or num features, which is nice to user. And by looking at the statistical metrics, there are no obvious anomaly/extreme value in the data.
Let’s also plot some visualizations to understand or demonstrate data more. Stock Price Distribution as a boxplot and average price by months in years are also shown below.
dim(final_data)
## [1] 439007 9
str(final_data)
## 'data.frame': 439007 obs. of 9 variables:
## $ timestamp : POSIXct, format: "2018-01-02 09:00:00" "2018-01-02 10:00:00" ...
## $ short_name : chr "AKBNK" "AKBNK" "AKBNK" "AKBNK" ...
## $ year : num 2018 2018 2018 2018 2018 ...
## $ month : num 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 2 2 2 2 2 2 2 2 2 2 ...
## $ hour : int 9 10 11 12 13 14 15 16 17 18 ...
## $ volume : num 22346109 22346109 22346109 22346109 22346109 ...
## $ forecast_ma_6: num 6.95 7.06 7.1 7.08 7.1 ...
## $ price : num 6.95 7.06 7.1 7.08 7.1 ...
summary(final_data)
## timestamp short_name year
## Min. :2018-01-02 09:00:00.00 Length:439007 Min. :2018
## 1st Qu.:2019-06-21 12:00:00.00 Class :character 1st Qu.:2019
## Median :2020-12-10 17:00:00.00 Mode :character Median :2020
## Mean :2020-12-11 03:22:19.56 Mean :2020
## 3rd Qu.:2022-06-01 16:00:00.00 3rd Qu.:2022
## Max. :2023-11-21 18:00:00.00 Max. :2023
## month day hour volume
## Min. : 1.000 Min. : 1.00 Min. : 9.00 Min. :9.000e+00
## 1st Qu.: 3.000 1st Qu.: 8.00 1st Qu.:11.00 1st Qu.:7.063e+06
## Median : 6.000 Median :16.00 Median :13.00 Median :3.042e+07
## Mean : 6.452 Mean :15.63 Mean :13.49 Mean :6.474e+07
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:16.00 3rd Qu.:8.191e+07
## Max. :12.000 Max. :31.00 Max. :18.00 Max. :1.073e+09
## forecast_ma_6 price
## Min. : 0.5844 Min. : 0.5817
## 1st Qu.: 5.0217 1st Qu.: 5.0229
## Median : 11.2297 Median : 11.2318
## Mean : 32.0485 Mean : 32.0766
## 3rd Qu.: 26.4633 3rd Qu.: 26.4800
## Max. :961.7500 Max. :975.0000
ggplot(final_data, aes(x = short_name, y = price, fill = short_name)) +
geom_boxplot() +
labs(title = "Stock Price Distribution",
x = "Stock",
y = "Price") +
theme_minimal() +
theme(legend.position = "none")
average_price_by_month <- aggregate(price ~ month + year, data = final_data, mean)
ggplot(average_price_by_month, aes(x = month, y = price, fill = factor(year))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average Stock Price by Month",
x = "Month",
y = "Average Price") +
theme_minimal() +
theme(legend.position = "top")
We have write a function detecting anomaly in the plots. Anomaly is detected in date, if the value is less or grater than the mean (data) +- 3 * stdev(data). There are 5 example stock prices plots to detect anomalies. None of them included any anomaly values (Rest of them are also controlled). This is also a good sign for us to create sense models.
plot_anomaly <- function(df){
anomaly_threshold_price <- mean(df$price) - 3 * sqrt(var(df$price)) # Adjust this threshold as needed
anomaly_values_price <- df$price < anomaly_threshold_price
stock_title <- paste(df$short_name[1],"Stock Prices Over Time with Anomalies", sep = " ")
ggplot(df, aes(x = timestamp, y = price)) +
geom_line() +
geom_point(data = df[anomaly_values_price, ], aes(color = "Anomaly"), size = 3) +
labs(x = "Datetime", y = "Stock Closing Prices", title = stock_title) +
scale_color_manual(values = c("Anomaly" = "red")) +
theme_minimal()
}
plot_anomaly(thyao_data)
plot_anomaly(akbnk_data)
plot_anomaly(vestl_data)
plot_anomaly(froto_data)
plot_anomaly(sise_data)
Finally, using quantmod library and its functionalities lets look at the visualizations of example stocks information such as THYAO.IS and SISE.IS. These candlestick and line plots are helpful for further analysis also since we get insight how current volume affects further stock price etc.
Also, all stock prices are plotted as single plot in below. It can be seen that some of the stocks have multiplied prices accross many and we think that the splitting dataset as different stocks might be good idea to block any negative effects.
# Plot candlestick chart for one stock (e.g., THYAO, SISE)
thYaoData <- getSymbols("THYAO.IS", src = "yahoo", from = "2018-01-02", to = "2023-11-22", auto.assign = FALSE)
candleChart(thYaoData, theme = "white")
siseData <- getSymbols("SISE.IS", src = "yahoo", from = "2018-01-02", to = "2023-11-22", auto.assign = FALSE)
candleChart(siseData, theme = "white")
ggplot(all_stock_data, aes(x = Date, y = Close, color = Symbol)) +
geom_line() +
labs(title = "Closing Prices Over Time for Selected Stocks",
x = "Date",
y = "Closing Price",
color = "Stock Symbol") +
theme_minimal()
Let’s also analyze the data by applying clustering methods. It will help us to see which stocks are similar to each other in terms of distance.
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.
In this dataset, we can obtain insights about which stocks are similar etc.
Let’s start with k means clustering :
cluster_data <- final_data
features <- cluster_data[, c("volume", "forecast_ma_6", "price")]
# Scale the features
scaled_features <- scale(features)
# Choose the number of clusters (k)
k <- 4
# Apply k-means clustering
kmeans_result <- kmeans(scaled_features, centers = k, nstart = 25)
# Add the cluster labels to your data
cluster_data$cluster <- as.factor(kmeans_result$cluster)
head(cluster_data)
## timestamp short_name year month day hour volume forecast_ma_6
## 1 2018-01-02 09:00:00 AKBNK 2018 1 2 9 22346109 6.9475
## 2 2018-01-02 10:00:00 AKBNK 2018 1 2 10 22346109 7.0602
## 3 2018-01-02 11:00:00 AKBNK 2018 1 2 11 22346109 7.0954
## 4 2018-01-02 12:00:00 AKBNK 2018 1 2 12 22346109 7.0814
## 5 2018-01-02 13:00:00 AKBNK 2018 1 2 13 22346109 7.1024
## 6 2018-01-02 14:00:00 AKBNK 2018 1 2 14 22346109 7.1377
## price cluster
## 1 6.9475 4
## 2 7.0602 4
## 3 7.0954 4
## 4 7.0814 4
## 5 7.1024 4
## 6 7.1377 4
# Scatter plot in 3D
plot_ly(cluster_data, x = ~volume, y = ~`forecast_ma_6`, z = ~price, color = ~cluster,
text = ~short_name, # This adds stock names as labels
type = "scatter3d", mode = "markers", marker = list(size = 5)) %>%
layout(scene = list(xaxis = list(title = "Volume"),
yaxis = list(title = "forecast_ma_6"),
zaxis = list(title = "Price"),
margin = list(l = 0, r = 0, b = 0, t = 0)))